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(human or computer) as a decision maker. Given a descriptive 
f the learning process, it is in some cases possible to derive 
conditions for instruction. The technique of optimization is 
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list of items should be divided (whole versus part) and (2) 
choice of items as a function of the previous response 
, which involves dropping ’’learned” items during a training 
e. "^he efficiency of the dropout procedure depends on list 
on only. A learning model which serves as a plausible 
mation was developed for both the above examples. The problem 
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STATEMENT OF FOCUS 



The Wisconsin Research and Development Center for Cognitive Learning 
focuses on contributing to a better understanding of cognitive learning by- 
children and youth and to the improvement of related educational practices. 

The strategy for research and development is comprehensive. It includes 
basic research to generate new knowledge about the conditions and processes 
of learning and about the processes of instruction, and the subsequent develop- 
ment of research-based instructional materials, many of vv^hich are designed for 
use by teachers and others for use by students. These materials are tested and 
refined in school settings. Throughout these operations behavioral scientists, 
curriculum experts, academic scholars, and school people interact, insuring 
that the results of Center activities are based soundly on knowledge of subject 
matter and cognitive learning and that they are applied to the improvement of 
educational practice. 

This Theoretical Paper is from the Language Concepts and Cognitive Skills 
Related to the Acquisition of Literacy Project in Program 1 . General objectives 
of the Program are to generate new knowledge about concept learning ahd cogni- 
tive skills, to synthesize existing knowledge, and to develop educational 
materials suggested by the prior activities. Contributing to these Program objec- 
tives , this project's basic goal is to determine the processes by which children 
aged ^our to seven learn to read and to identify the specific reasons why many 
children fail to acquire this ability. Later studies will be conducted to find ex- 
perimental techniques and tests for optimizing the acquisition of skills, needed 
for learning to read. 
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ABSTRACT 



With the advent of computers in the classroom, the decision structure im- 
plicit in the teaching process has been made obvious. Given a descriptive 
model of the learning process, it is in some cases possible to derive optimal 
conditions for instruction. 

The technique of optimization is illustrated by two examples. In the first, 
the question of the optimal block-size for the learning of foreign language vo- 
cabulary is discussed. The prediction from a two-operator learning model is 
that presentation of the entire list should produce the most efficient learning. 
Despite certain failings of the model, available data are consistent with the 
prediction. 

The second example concerns the application of dynamic programming to 
optimal selection of items in a paired-associate problem. It is shown that, 
given some general constraints on the learning model and the gain function re- 
lating the increase in probability of a correct resppnse to the model, the optimal 
decision at any point is to present that item from the list which produces the 
largest immediate increment in response probability. 

The problem of relative efficiency is also taken up, and it is shown that, 
assuming that learning occurs in an all-or-none fashion, a "dropout" procedure 
is considerably more efficient than the standard procedure commonly used if 
presentation of items stops only after a list criterion is reached. The effic- 
iency of tile dropout procedure is shown to depend only on the list criterion. 
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INTRODUCTION 



The digital computer has begun to take 
its place in the classroom as an instructional 
device. Presently, such installations are 
limited mainly to experimental settings (Suppes, 
1966). Although it remains to be seen whether 
computer-aided instructional systems will 
prove either pedagogically or economically 
practical, the problem of efficient utilization 
of such systems is likely to hold the interest 
of psychologists and other behavioral sci- 
entists for some time. 

The computer may be used simply as 
another means of implementing a traditional 
instructional format; e.g., one might simply 
load a programmed text into the machine, 
which could then present the material to the 
student. A more exciting potential of com- 
puter-aided instruction concerns the indi- 
vidualization of instruction, a process in 
which the machine determines each step in 
the instructional sequence on tlie basis of 
past performance using a suitable decision 
'strategy. 

A couple of examples might be helpful. 

One decision strategy might entail selection 
of a presentation rate which best matches the 
student's ability to assimilate information. 
Suppose some material is presented and the 
student is then tested. If he responds cor- 
rectly, the computer goes on to new material; 
if incorrectly, the problem is presented again, 
perhaps in a different form. Thus new ma- 
terial is presented only as fast as the student 
can assimilate it. 

An elaboration of this notion involves 
deciding what "track" to put a student on. 

In the teaching of beginning reading in Amer- 
ican elementary schools, a three-track sys- 
tem is typically used. Classes are divided 
into children of high, middle, or low reading 
ability. Low ability readers use basal pri- 
mers with restricted vocabularies where the 
stress is on rote memorization of words, 
while for high ability readers the vocabularies 
tend to be much larger and there is more em- 
phasis on development of word attack skills 



based on an analysis of letter-sound relations. 
Computer-aided systems could in principle 
continuously evaluate a child's performance 
and change hj.m from track to track as appro- 
priate. If a child were having a great deal of 
trouble during the early stages of reading, he 
would be placed on a low track. As he began 
to perform more adequately, the machine could 
take the necessary steps to move him to the 
next higher track. 

It is easy to imagine in a general way how 
computer-aided systems might be used to in- 
stitute significant innovations in education. 

On the other hand, the consideration of such 
systems has led to analyses of the educational 
process which reveal some fundamental ques- 
tions. It seems safe to say that if a revolution 
in education is brought about in this generation, 
it will result not from the introduction of com- 
puters into the classroom, but from (a) a more 
detailed analysis of specific curriculum 
goals, (b) development of more adequate 
models of the process by which students 
learn, and (c) increased attention to the func- 
tion of the teacher (human or computer) as 
a decision maker. The first of these matters 
is a problem for curriculum specialists; the 
second and third problems are the province of 
the psychologist. Each of these matters con- 
stitutes a fundamental and unsolved problem 
in education. 

For example, the model of the student 
implicit in most educational practices may 
perhaps be best stated as "practice makes 
perfect." The decision rule has been, "if at 
first you don't succeed, try again." The fail- 
ing student is encouraged to repeat the same 
activity until either he "gets it" or runs out 
of time. The adequacy of both the model and 
the decision rule may be called into question. 

When one considers the variety of com- 
binations of ideas about what is to be taught, 
how it is going to be learned, and what de- 
cision policies will be most efficient, it is 
obvious that the odds are very small that 
optimal teaching programs will be arrived at 
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by chance. Yet the absence of formal pro- 
cedures for the development of teaching pro- 
grams and the strong reliance on intuition, 
common sense, and previous practice suggest 
that just such a search for a needle in a hay- 
stack has been going on. 

Some efforts at formalization have ap- 
peared during the past few years within a 
specific theoretical framework. Suppose that 
for a particular portion of a curricular program 
(e.g,, learning tne names of the letters of tne 
alphabet), there exists an adequate mathemati- 
cal model of the learning process. Then in 
some instances, this model may allov/ one to 
discover unique decision rules which are opti- 
mal in the sense that the student will learn as 
much as possible in a fixed period of time. 

In this paper, two optimization problems 
will be considered: (a) the optimal num.ber of 
blocks into which a list of items should be di- 
vided (the whole versus part problem) and 
(b) optimal choice of items as a function of the 
previous response history, which involves the 
question of dropping out "learned" items during 
a training sequence. For both these examples, 
it has been possible to obtain solutions to the 
optimality problem by means of a learning 
model which serves as a plausible approxima- 
tion, albeit a gross simplification. In neither 
example are there sufficient data for empirical 
evaluation of the efficiency of various presen- 
tation strategies. Nevertheless, these two 
problems will illustrate the utility of mathema- 
tical learning models in che derivation of opti- 
mal teaching procedures, and the techniques 
as well as the problems of this approach. 



Before proceeding, it may be useful to 
characterize the instructional paradigm more 
®^plicitly. First, the material to be taught 
will be conceptualized as a set of stimulus 
items, Sj, for each of which there exist one 
or more correct responses, R., These S-R 
pairs may be as simple as saying the pho- 
neme /e/ to the symbol or as complex as 
writing an answer to the question, "V/hat 
were the major causes of the American Civil 
War?" after reading a 1000-word passage on 
the topic. We also define a history at any 
point in time as the knowledge available to 
the instructor at that time about the pre- 
ceding sequence of S and R events. Then, 
following Groen and Atkinson (1966), the 
succession of events in an instructional 
system can be diagrammed as in Fig. 1. 

Each presentation begins with the selection 
of some Sj[, followed by some response, then 
the selection of the next Sj , and so forth. 

It may be easiest to think of this procedure 
in terms of the learning of a foreign language 
vocabulary, such as acquiring the English 
equivalents of German words. Generally 
speaking, it will be assumed that there are 
unique responses to be associated with each 
stimulus, and that each pair must be presented 
several times for learning to take place. If 
the selection is contingent on the previous 
response history, the system will be referred 
to as dynamic (response-sensitive in Groen 
and Atkinson's terms) while if the selection 
is independent of the history the system will 
be called static (or response-insensitive). 



start Instructional Session 




Terminate Instructional Session 



Figure 1. Flow Diagram of Instructional Sequence 



II 

STATIC DECISION STRATEGIES 



The major feature of static strategies is 
that the sequence of S--R items can be pre- 
pared prior to the beginning of instruction; 
the history is not necessary. As an example 
of a static problem, suppose the number of 
items in a list is quite large so that it appears 
sensible to break it ir ■ ) two or more sublists 
or blocks, each of which is taught in turn. 

V\/hat is the optimum number of blocks or, 
equivalently, the optimum block size? 

We will consider Suppes' (1964) analysis 
of this problem in some detail, because a solu- 
tion exists and is relatively easy to present. 
The problem can be specified as follows. A 
list of M items is to be divided into blocks of 
k items each. Each block is given N trials 
(i.e,, the items in the first block are each 
presented N times, then the items in the sec- 
ond block, etc.), and then a final test is ad- 
ministered covering the entire list. The prob- 
lem is to find that value of k which maximizes 
the proportion of correct responses on the 
final test. 

Suppes assumed that in this situation the 
learning process could be described by the 
following two-operator linear model } Con- 
sider the simplest possible case. Two items 
are to be learned; i.e., M = Z. Each time 
item 1 is presented and studied by the stu- 
dent, the probability of error on the next test 
of that item is assumed to decrease, whereas 
each time item Z is presented for study, the 
probability of an incorrect response for item 1 
increases. In short, study presentations for 
an item lead to learning of it, and presenta- 
tions of other items lead to forgetting of it. 

The (decrements and increments in error prob- 
ability are assumed to be proportional to the 



^In the sections that follow, where deri- 
vations involving specific models of learning 
are carried out, it is assumed that the reader 
has some background in probability theory of 
familiarity with the literature at the level of 
Atkinson, Bower, and Crothers (1965). 



amount remaining to be learned or forgotten, 
respectively. Specifically, if q. is the 

error probability for item i on the w+lst presen- 
tation, then the effect of studying i on n is 



q. 

i,n+l 



aq , 0 < a < 1 . 
i,n 



( 1 ) 



Suppose p, is the success probability for 

z on n-vl. Then tlie effect of studying some 
other item on n is 



P 



i,n+l 



= bp, , 0 < b < 1 . 
i,n 



( 2 ) 



Since by definition p, , = 1 - q, , , by 

i,n+l i,n+l 

simple algebra it can be seen that (Z) may be 

written as q. , , = (1 -&) + &q, which will 
i,n+l i,n 

prove to be a more useful form. The param,- 
eters a and & constitute hie learning and for- 
getting rates which depend on the student's 
ability, the material to be learned, and other 
factors. As it turns out, the solution to the 
optimization problem depends only on which 
is larger, a or &. 

Equations 1 and Z (the two operators) are 
linear difference equations, the properties of 
which are well known (Goldberg, 1961). In 
particular, for an equation of the general form 



X 



t+1 



Rx^ + S , 



( 3 ) 



the solution is 



X 



t+1 



R^x, 



+ S 



1 -R 
1 -R 



(4) 



As an example, suppose item i is studied 
for t consecutive trials. For convenience, as- 
sume that initially the chance of giving the 

correct answer is zero, i.e., that q. = 1. 

1 , i. 

Applying (1), 2 ^ ^^i I ^ ^i 3 ^^i Z ^ 

a^\ . . . ; q. ■ 1 = which agrees with the 
1 , trl 

solution given in (4) for a = R and S = 0. 

We will now present in outline form 
Suppes' solution to the problem of optimal 
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block size based on this model. For conven- 
ience, it may be assumed that ^ all 

i, Suppose the list is divided into blocks of 



size k. Taking a particular item i, at some 
point the block containing i receives its series 
of N trials, as indicated in the diagram below: 



Block 1 



(k items) 

Trial 

1 



(k items) 

Trial 

2 



(k items) 

Trial 

N 



Block I 



.£ 



(k-l items) f (k-1 items)(k-l items)^ 



item i - 



I 



Trial 

1 



Trial 

2 



Trial 

N 



Block M/k 



(k items) 

Trial 

1 



(k items) 

Trial 

2 



(k items) 

Trial 

N 



The analysis is simplified by assuming 
that an item is presented at the same relative 
location in the block on each of the N trials. 
There will be N-l complete cycles on each 
of which i is presented followed by the 
other items in the block. Then i receives its 
JVth presentation, the remaining items in the 
block are presented, and then there are Nk 
presentations for each of the blocks remain- 
ing to be studied. Let i?, denote the number 

of presentations of other items following the 
last presentation of i. 

Consider the events of a single cycle. 

Let be the probability of an error for i 

at the time of its presentation on trial j. 

After i is presented and studied, the new 

(j) 



probability will be «q, 



Following the k-l 



other items in the block (i.e., just prior to 
its presentation on trial ; + 1), from (2) and 

(4) 

(j) 






(l-b)(l-b’^~S 
_ (1-b) 



^k-1 (j)^., ,k-L 

ab q, ' + ( 1 -b ) . 



(5) 



However, (5), which incorporated all the 
events in a cycle, is itself a linear differ- 
ence equation. Hence, following the N- 1 
cycles in a block, or just prior to the last 
presentation 



^i 



(N) 



( )N-l^ (l-y) [ l-(ay)”-‘ ] 

' ' 1 - ay 



( 6 ) 



where y = b 



k-l 



There follows the Nth presen- 



tation of i, and then presentati^gns *of 

other items, whence the probability of an er- 
ror at the time of the firial test is 



1 1 



(7) 



Only the quantity varies from item to item. 

(F) 

The Calculation of the average of q^' over 

items is not trivial, but the details are not 
particularly instructive (but cf. Suppes, 1964) 

k. 

Letting v = a/b and w = b , it can be shown 

(F) 

that the expectation of q, is 



[-, .MN,,, ,-| 

(1-b )(l-a) 


r / sN-| 
l-(vw) 


1-w 


(l-b)M 


1 -yrw 


1 N 
Li-w J 



E(q) = l- 



( 8 ) 

Only the last hvo terms in (8) need to be 
considered in determining the optimal choice 
of k, since only is a function of k. Notice 
that if y = 1 (i.e. , if a = b) then the last two 
terms cancel. Hence, if the rate parameters 
are equal (if the relative amount learned dur- 
ing study of an item is exactly offset by 
presentation of some other item), performance 
does not depend on block size, so there is no 
optimal block size. 

If y < 1 , then E(q) decreases monotoni- 

cally with k (cf. Appendix A) and so the largest 

block size should be used — k ^ = M. Note 

opt 

that y< 1 implies that a< b, so that the rate 
of learning when an item is presented for 
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■SICUl 






m 



study is greater than the rate of forgetting 
when some other item is presented, Con- 
versely, if V > l, E(q) increases with k, 

hence k , = 1, 
opt 

According to this model, there are only 
two optimal block sizes — either take the en- 
tire list and present it as a single block N 
times, or present the first item N times, then 
the second item, etc. — and the choice de- 
pends only on whether « or & is greater. In 
fact, only the case where « < & is of more 
than passing interest. It is not difficult to 
show that a b < a then letting k ~ i (which 
is optimal) 



lim E(q) = 1 
N~>co 



M 



( 10 ) 



In other words, only one item v/ill be learned. 
Since in most learning situations perfect per- 
formance is reached after prolonged praccice, 
it appears reasonable to assume a > b. One 
would therefore predict that whole learning, 
as it has been referred to in the psychological 
literature, should be more efficient (or at 
least as efficient) than learning, in 
which the list is broken into some number of 
blocks. However, the relative efficiency of 
a part procedure will depend on the list length 
and the num.ber of blocks into which the list 
is broken, among other factors. 

The experimental evidence on the point 
is not especially enlightening. Reports can 
be cited (cf. Osgood, 1953, pp, 540 ff.) in 
which the whole procedure was superior, or 
the part procedure was superior, or (more 
usually) no difference was obtained. Unpub- 
lished work in our laboratory involving rela- 
tively long lists of German— English pairs, 
and the extensive investigations by Crothers 
and Suppes (1967) on acquisition of Russian 
vocabulary indicate that a list may be broken 
into two or three blocks with no loss in ef- 
ficiency, and in some instances a gain in 
efficiency is observed (Exps. X and XI, 

Crothers & Suppes, 1967). Breaking a list 
into a large number of small blocks generally 
seems detrimental, so that under certain con- 
ditions the empirically optimal block size 
may be an intermediate value between 1 and 
M, contrary to iTie predictions of the two- 
operator model. 

This model is plainly an oversimplifica- 
tion of those processes which allow a student 
to learn a paired-associate list. Certain 
features of the model compromise its useful- 
ness for prediction of optimal teaching con- 
ditions in all but a few contexts. For ex- 
ample, in (10) it was shown that it a < b , 



at most one item would be recalled after a 
large number of trials. This prediction arises 
because no matter how many study trials are 
given to an item, it is completely forgotten 
after the presentation of a large number of 
other items. Consider an alternative model 
where, in addition to the short-term acquisi- 
tion and retention processes described by ( 1 ) 
and (2), more permanent storage of information 
takes place during study trials. This perma- 
nent storage can be expressed by the follow- 
ing modification of (2); 



*^i , n+ 1 



= bq, + (1 
i,n ' 



( 11 ) 



The parameter y is the rate of permanent stor- 
age, and r is the number of preceding study 

trials on i. Suppose is the error proba- 



bility for i following the r study trials. As 

k, the number of other items following the 
presentation of i, becomes large, from (1) and 
( 11 ) 



k->00 ^ 



r-1 



( 12 ) 



The parameter a now effectively represents 
the probability that an incorrect response will 
occur immediately following a study trial. 
Considerable evidence suggests that tliis 
probability is small. Making the assumption 
that a is zero, it is not difficult to show that 



E(q) = y 




I 



ni-b“^i 


1-w 


(l-b)M 


1 N 
_l-w _ 



(13) 



The derivative of (13) with regard to k is al- 
ways negative, implying that the whole-list 
procedure will always prove most efficient. 

Another shortcoming in both models is 
that they assume only the simplest interac- 
tions among items within a list. However, 
there are a number of plausible sources of 
interaction; the similarity between the ele- 
ments of i and j, the degree of learning of 
an item, and the extent to which the student 
decides to attend carefully to one particular 
item rather than another. 

Nonetheless, this example illustrates the 
technique of optimization, and the kinds of 
applications which are possible. Extension 
of these techniques to matters of the sort just 
mentioned await the development of more 
suitable and comprehensive models. 



DYNAMIC DECISION STRATEGIES 



In this section we consider optimization 
problems which conform to the same S-R 
paradigm as before, but where the student's 
history in the session is used in choosing 
the sequence of presentation. The earliest 
paper on this topic is by Smallwood (1962), 

He poseu +he optimization question in a gen- 
eral form and introduced the technique of 
dynamic programming into the learning lit- 
erature. His ideas about learning models 
were not as well developed, and his paper 
will not be considered here in any detail. 

Instead, we will look at a question con- 
sidered recently by several investigators 
(Groen & Atkinson, 1966 ; Karush & Dear, 1966 ; 
Matheson, 1964). The single-operator linear 
or incremental and all-or-none models have 
come to be used as standards of a sort (whip- 
ping boys might be more accurate) in the field 
of learning models. Both are extremely simple 
one-parameter models which are patently in- 
capable of handling the more complex aspects 
of human verbal learning. However, they do 
make different predictions about certain fea- 
tures of learning data, and, as it turns out, 
they lead to quite different optimal presenta- 
tion strategies. 

The incremental model is a simplifica- 
tion of the two-operator model previously 
introduced, in which forgetting is assumed 
to be a negligible factor; i,e,, only permanent 
learning is taken into account. Specifically, 
following the n^ study trial. 



q. , , = aq, 
i,n+l i,n 



(14) 



which is equivalent to (1), and so from (4) 



n 

9 . , = a q, 

i,n+l ^1 



(15) 



where q^ is the initial error probability, which 

for present purposes will be assumed to be 
generally less tlian 1. Notice that (14) is 
applied on every study trial, that application 
of (14) is independent of the student's re- 
sponse on the preceding test, and that tlie 



response gives no information about the state 
of learning of the item. If the item has re- 
ceived n study trials, then (15) holds, inde- 
pendent of the preceding response history. 

Note also the implicit assumption that a is 

the same for all items in the list so that q , 

n 

the average probability of an error on n, is 
also equal to (15). 

The all~or~none model can be charac- 
terized as a two-state Markov process with 
an absorbing state. It is assumed that each 
item may be in one of two states; U, an un- 
learned state in which nothing has been learned 
about the correct response, and L, a learned 
state in v.'h^ch the correct response is always 
given. It may help to think about this model 
in teriMS of the utilization of mnemonic codes. 
On each trial, the student searches through 
his memory for anything which will help him 
remember the S-R association. Before such 
a mnemonic link is discovered the student 
will be in V. Once a link is found for an 
item, learning is complete for that item. The 
model can be described by a transition matrix 
and column vector of response probabilities. 



State on trial n 



L 

U 



State on trial n+1 

_L U 

1 0 

c 1-c 



P( Incorrect 
Row Vector) 

ol 



On each study trial, if an item is still in XJ, 
then with probability c it is learned, other- 
wise it remains in U. If an item is in L , it 
remains there. (As in the incremental model, 
there is no forgetting.) It is assumed that the 
parameter c is the same for all items. The 
guessing parameter q^ is usually taken to 

be the reciprocal of the number of available 
responses. 

Suppose that q^ is the average number 

of items in U on trial n, Then, on the average, 
a proportion c of the items will change to L 
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on trial w + 1 , and 1 - c will remain in U. 
Therefore 

q , , = (1 - c)q 
^n+1 ' '^n 

which is identical in form to (14) if 1 - c is 
substituted for a, and hence from (15), 

"^n+1 " (1^) 

In short, although the two models are based 
on different ideas about the psychological 
processes, the same mean learning curve is 
predicted by both. 

There are some important differences in 
the predictions of the two models, however. 
Those differences hinge upon the response- 
dependent character of the all-or-none model. 
Specifically, an error is an observable re- 
current event in this model; if the student 
makes an incorrect response on some item, 
then the item must have been in U , and none 
of the preceding study trials had any effect 
on the likelihood of a correct response for 
that item. Comparison of response-dependent 
statistics from the two models points up the 
difference clearly. Consider the likelihood 
that an error on trial n is followed by an error 

«+ 1, Pr(e , , |e ). According to the incremen- 
ut i n 

tal model, the error probability on trial w + 1 

depends only on the fact that there were n 

previous study trials, so that Pr(e |e = 

^ ' n+ 1 ' n 

a q^. In the all-or-none model, on the other 

hand, the error onn indicates that the item 
had not been learned up to that point. For 
an error to occur on the next trial, it must be 
the case that, with probability 1 - c, the nth 
study trial was also ineffective, and that stu- 
dent failed to guess the correct answer with 
probability q^. Putting these two events to- 
gether, for the all-or-none model, Pr(e , |e ) = 
(1 -c)q^. This probability is not a function of 

the trial number; it is predicted to remain 
constant throughout learning by the all-or- 
none model, 

Thus, the conditional probability of an 
error is predicted to remain constant in the 
all-or-none model and to decrease at an ex- 
ponential rate by the incremental model. In 
general, data from verbal learning experiments 
are closer to the constancy predicted by the 
all-or-none model than the exponential de- 
crease predicted by the linear model, although 
a slight decrease is typically observed 
(Calfee. Atkinson, & Shelton, 1965). 



Now we turn to the question of optimal 
decision strategies based on these two models. 
The problem will be posed in the follov/ing 
way. Suppose that a fixed number of S-R 
presentations are to be given, and that the 
entire sequence of successes and errors for 
each item in the list is available. How may 
this history be used to choose an item on each 
presentation such that learning is optimal? 

The solution based on the all-or-none model 
can be understood intuitively as follows. 
Presentation of learned ' ems is a waste of 
time, because once an item is learned it re- 
mains learned. Hence, the problem becomes 
oriG of locating that item which has the best 
chance of being in the unlearned state. Errors 
provide this information, in the sense that 
they identify an item as being in U at the 
time that the error occurs. The more successes 
since the last error, the less likely an item is 
to be in U. Hence for the all-or-none model, 
the optimal decision strategy is to present 
that item with the fewest successes since 
the last error, 

The technique used to obtain a formal 
solution of the problem (as opposed to the 
intuitive argument given above) is dynamic 
programming {^eWman, 195 7), Suppose a 
sequence of D decisions must be made, and 
it is desirable that the sequence of decisions 
yield an optimal outcome of some sort. Con- 
sider the last or decision first. That 
choice should be made that produces the 
greatest immediate gain since there are no 
more decisions to be made. Now take the 
(D-1)^^ decision. The optimal decision at 
this point is the one which maximizes the 
gain to be collected from tiiat decision and 
the last one, assuming that the last decision 
is optimal. The technique involves working 
backward in this fashion, always making that 
decision which yields the highest return from 
that point on, assuming the remaining deci- 
sions are optimal. 

In the learning problem, the choice of a 
particular item constitutes the decision, and 
the transition of an item to L comprises the 
gain, (Note that this uriobservable gain is 
paid off at the time of the final test.) The 
application of dynamic programming is sim- 
plified if the following Markovian property 
of the gain function holds; the payoff arising 
from the last h decisions depends only on 
(a) the accumulated gain just prior to the first 
of these decisions and (b) whatever the last 
k decisions are. In particular, the payoff 
from the last k decisions must not depend 
upon the particular manner in which the gain 
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in (a) was accumulated, Since in our problem 
the gain is dependent only on performance on 
the final test trial, the Markov property clearly 
holds.^ 

While the gain or payoff is not realized 
until the final test, it is possible to estimate 
the probability of gain for each item that 
might be presented on any trial. This gain 
is the probability that an item is transferred 
to L if it is presented, If the item is still 
in U, then if it is chosen for presentation the 
likelihood of a transfer is c. If it is in I, , 
then the gain is necessarily zero. Thus we 
want to find that item with the highest prob- 
ability of being in U. All events prior to the 
last error can be completely disregarded, 
since at that point we know the item was in 
U and none of the preceding study trials had 
been effective. It is not difficult to show that 
the more successes since the last error, the 
smaller the probability that an item is in U 
(Appendix B), Hence, the largest payoff on 
any trial is obtained by presenting that item 
whose history indicates the fewest successes 
since the last error. 

The next problem is to determine the con- 
ditions under which selection of the item 
which yields the highest (expected) payoff 
is also an optimal decision strategy. That is, 
under certain conditions, choice of an item 
which has the highest payoff on a particular 
presentation is not necessarily the optimal 
choice in the long run. It is useful to intro- 
duce the concept of largest immediate gain 
(LIG) to refer to the maximum increment in 
payoff on any single trial, disregarding the 
potential consequences of the selection at 
any later stage of the process. If the I';,.;rn- 
ing process can be described as a M u'.tov 
process with observable states, and given 
some minimal constraints on the gain function, 
then it can be shown that the LIG decision 



^As an example of a non-Markovian sit- 
uation, suppose that the cost of presenting 
items to the student enters into the evalua- 
tion, so that we are trying to maximize amount 
learned and also minimize the cost of opera- 
ting the system. Assume also tliat the cost 
of presenting item i depends on what item 
had just been presented, which might bo true 
if the items are messages on a serial device 
such as a magnetic tape recorder. Substan- 
tial cost in time between presentations might 
be incurred if the items are not close to each 
other. In this instance the payoff on presen- 
tation «+l depends on tlie payoff on n and 
the item just presented. 



is an optimal decision under dynamic pro'- 
gramming (Appendix 0), 

As mentioned above, for the all-or-none 
model, the LIG decision is to present the item 
with the fewest successes since the last er- 
ror, For the incremental model, the gain asso- 
ciated with item i is the increment in p, which 

1 

accrues by presenting i. If an item has pre- 
viously received n presentations, then its 
gain will bo 

(1 -a'^'^^q^) -(1 -a'’q^) = a'^(l -a)q^. (17) 

In the incremental model, the gain de- 
pends only on how many times an item has 
been presented and is not a function of the 
response history. Since (17) decreases as n 
increases, the LIG decision is to present tliat 
item witli the fewest presentations. Equiv- 
alently, an optimal strategy is to present each 
item in the list once, then cyi-'.le through tlic 
list a second time, etc,, until the available 
number of presentations have been used up. 

In summary, the optimal decision strat- 
egy for the incremental model is to adopt the 
standard procedure used in most paired- 
associate studies — present tlie list of items 
in random order for as many trials as are 
available. The all-or-nono model implies 
that a form of dropout strategy should bo most 
efficient. An item that has reached a criterion 
of k successes since the last error should not 
be presented again until all other items have 
reached the same criterion, 

Research on the relative effectiveness 
of the dropout procedure is limited. The drop- 
out procedure was apparently introduced half 
a century ago by Woodworth (1914) but has 
been only intermittently employed since, 
mostly as a means of equating the learning 
of individual items (e.g.. Madden, Adams, 

& Spence, 1950). (Interestingly enough, 
Woodworth concluded that even when the 
dropout procedure was used, the more quickly 
an item reached criterion, the more likely it 
was to be recalled on subsequent tests.) 

Battig (1965) argued for the dropout procedure 
as a remedy for various technical difficulties 
in the paired-associate method and mentioned 
that in one experiment the dropout procedure 
"require[d] significantly fewer errors (as well 
as less time and fewer total item presenta- 
tions) to reach a criterion of an errorless 
trial. . .[p. 7]." No data were reported. 

Dear, Silberman, Eastavan, &t Atkinson 
( 1967) have reported two studies designed 
to evaluate the relative efficiency of tlie 
dropout procedure. The stimuli were 32 
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distinctive 2-digit numbers. The correct re- 
sponse to each stimulus was one of 4 push- 
button switches, Each response was the 
correct response for a different set of 8 stim- 
uli, so the task was in some ways analogous 
to 0 concept identification problem. The list 
of stimuli was divided into two sets of 16 
items. For each subject, items from one set 
were presented on alternate trials in the 
standard fashion, while on the remaining 
trials, items from the other set were presented 
using the dropout technique. In the first study, 
each of 81 subjects was run for 640 anticipa- 
tion (test-study) presentations, followed by 
3 tost cycles through all 32 items in the list, 
(For 44 of these subjects, there were 2 re- 
sponses instead of 4, but there was no major 
difference between the groups in the pattern 
of results,) The number of correct responses 
on the final tests was slightly higher for 
dropout items than for standard items, but 
the difforonce was not statistically reliable. 
This was in spite of the fact that a theoreti- 
cal analysis (assuming that learning was 
all-or-nono) indicated that about 75% of the 
dropout items should have been learned, but 
only 45% of the standard items. 

In the second study, in which 36 students 
participated, the only major change was that 
the training period was terminated when a 
correct response was given to 10 of the 16 
items being presented in a standard manner. 
Tost performance on the standard items was 
significantly better than on the dropout items. 
From these results. Dear ot al, wore led to a 
rather pessimistic conclusion about the ef- 
ficiency that might be achieved by variation 
in presentation strategies. 

It may be premature to give up hope too 
quickly, however, especially in light of the 
other studios referenced above. It would 
seem that the dropout procedure should have 
certain advantages over the standard pro- 
cedure if for no other reason than that cer- 
tain items are often easier to learn than 
others. Thus, in learning a foreign language. 



less time should be spent on cognates tlian 
on unfamiliar items and pseudo-cognates. 

Dear et al, attempted to select a set of stim- 
uli-response pairs of about equal difficulty, 
and so the comment above may not be es- 
pecially germane to their results. 

The two models used for optimization 
are both incomplete in one obvious respect, 
viz,, neither incorporates a mechanism to 
handle forgetting, which arises from the efforts 
of the student to learn other items in the list. 
Models exist which provide representation 
for both learning and forgetting processes 
(Atkinson & Cr others, 1964; Calfee & Atkin- 
son, 1965), However, it has proven difficult 
to derive optimal strategies based on these 
more complex models. 

As an alternative to the use of a more 
complicated model, one might change the ex- 
perimental procedure in certain ways. The 
anticipation technique used by Dear et al, has 
the undesirable feature that once an item 
reaches some performance criterion in the 
dropout procedure, it is not likely to bo pre- 
sented for study for some time, and conse- 
quently no tests are run for that period. If 
for any reason the item is forgotten, the teach- 
ing system remains insensitive to the loss, 

A more appropriate experimental procedure 
might involve separation of test and study 
events. Every item would be tested at regu- 
lar intervals, but the choice of items for study 
would be determined by an optimal decision 
strategy (also, cf, Battig, 1965), As an al- 
ternative, tlie anticipation technique might 
be used but the teaching system might keep 
a running tally for each item of how many 
trials have passed since it was last presented. 
After some maximum was reached, an item 
would bo presented for review. Both of these 
techniques would produce a better history on 
which to base decisions, a history which 
would be loss likely to bo misleading be- 
cause of short-term memory and chance guess- 
ing effects. 
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IV 

EFFICIENCY OF DECISION STRATEGIES 



Calculation of the theoretical efficiency 
of various decision strategies has proven 
rather difficult. The application of dynamic 
programming, though it allows one to find an 
optimal strategy, does not generally produce 
a closed-form expression for the gain in ef- 
ficiency of that strategy over any alternative 
that might be proposed. In order to get an 
idea of the relative merits of the dropout pro- 
cedure, a series of computer simulations were 
carried out. In each simulation run, Monte 
Carlo data were generated for 100 stat- students, 
each learning a 10-item list. Independence 
of items was assumed, and the initial guess- 
ing rate was set at .1, An anticipation 

technique was used, and to get the process 
started there were two preliminary trials or 
cycles through the entire list. N additional 
presentations were then allotted to various 
Items according to some particular decision 
strategy. The efficiency of the strategy was 
determined after the N presentations by com- 
puting the average number of items in L for 
the all-or-none model, or the mean of q^ for 
the incremental model. 

Three decision strategies were evaluated: 
(a) the dropout procedure, (b) the standard 
trial procedure, and (c) a strategy based on 
an intuition that difficult items would have 
relatively more errors, hence items should be 
presented relatively in proportion to the error 
rate. This last strategy which appeared to be 
reasonable on intuitive grounds was included 
for comparison with the solution derived from 
a more formal approach, dynamic programming. 

The first series of simulation studies in- 
vestigated the efficiency of the dropout pro- 
cedure compared with the standard procedure, 
assuming an all-or-none model and identical 
learning rate parameters for all items. Strat- 
egy (c) was also included for comparison. 

The results are shown in Table 1 for values 
of c from .3 to .025, a range typical of the 
results of paired-associate experiments, and 
for N equal to 50, 80, and 160. 



Table 1. Proportion of Items in L at End of 
Training Following Standard, 
Dropout, and Error-Proportion 
Decision Strategies (Simulation 
Data Based on All-or-None Model, 
Homogeneous Parameters) 









Decision Strategy 






Standard Dropout 


Error- 


N 


c 






Proportion 


50 


.3 


.91 


1.00 


.97 




.1 


.53 


.65 


.56 




.05 


.33 


.31 


.30 




.025 


.16 


.17 


.16 


80 


.3 


.98 


1.00 


1.00 




.1 


.62 


.83 


.70 




.05 


.36 


.48 


.44 




.025 


.22 


.26 


.24 


160 


.3 


1.00 


1.00 


1.00 




.1 


.85 


1.00 


.96 




.05 


.61 


.80 


.69 




.025 


.3 8 


.46 


.36 



These data, while limited, provide some 
idea of the relative efficiency of the various 
procedures. When there is a difference, the 
dropout procedure is most efficient, the stand 
ard procedure least efficient; the strategy 
based on error proportions falls in between. 

It appears that the efficiency of the dropout 
procedure depends on both c and N. For cer- 
tain combinations of these variables, the 
dropout procedure is considerably more ef- 
ficient (e.g. , c = . 1 , = 80). However, if N 

is large a7id if learning is fast, then all the 
items will be in L at the end of training; con- 
versely, if the number of available presenta- 
tions is small relative to the learning rate 
then not many items will have been learned, 
regardless of the presentation procedure. 

For certain intermediate conditions which 



cannot be described in a simple fashion, sub- 
stantially more items will be learned if the 
dropout procedure is used. 

One slight change in procedure provides 
a more certain estimate of efficiency. The 
assumption of a fixed number of presentations 
facilitated the application of dynamic pro- 
gramming to the optimization problem. How- 
ever, suppose that instead of having a fixed 
number of presentations, the session is ter- 
minated only after some proportion of items 
in the list have reached a predetermined per- 
formance criterion (so many successes in a 
row), The performance criterion should re- 
flect the chance guessing rate — the higher 
the probability of being correct by chance, 
the more stringent should be the criterion. 

Again, assume that learning proceeds in 
an all-or-none fashion. The discussion will 
be simplified if is assumed to be 1; the 

general result can be obtained in closed form, 
but is less informative than the result below, 
which will depart little from the general result 
if Qj, is reasonably close to 1, Suppose the 

proportion of items which must be learned 
(i.e,, reach the performance criterion) is set 
at some value, Taking the standard pro- 
cedure first, on the average we can expect 
some proportion of the items to be learned on 
trial 1, an additional proportion on Z, etc. At 
some critical trial, the proportion of items 
learned will equal x. For the standard pro- 
cedure, let /?_ be the mean number of presen- 
b 

tations required before the last error.'* The 



^It should be noted that two criteria have 
been mentioned, (a) a performance criterion 
for each item of some number of consecutive 
successes, and (b) a list criterion for ter- 
minating the training session. Criterion (b) 
is attained when a proportion x of the items 
in the list have reached the performance cri- 
terion (a). 

^For convenience in deriving results for 
this section, we are considering the number 
of presentations to the last error, not the 
number of presentations required to complete 
the criterion run. Since these criterion trials 
add a constant to the numerator and denomi- 
nator of the efficiency ratio, k„/k_, an effic- 

S D 

iency ratio based on trials to criterion will 
be attenuated to an extent that depends on 
the length of the performance criterion and 
the actual values of and 



distribution of E , the presentation of last 
error is 



Pr(E = n) = ( 1 - c)'^"^c, (18) 

For the last error to occur on the presen- 
tation, it must be the case that learning did 
not occur on w - 1 previous opportunities, but 
then occurred on the presentation. The 
problem is to find such that 

^cd-c)^"^ (19a) 

i=l 



X ~ Yj Pi'(E= i) 

i=:l 



= 1 - (1-c) 



k 



S 



Rearranging terms and taking logarithms to 
base 10, 



u _ lga- (i -X) 

S ~ log ( 1 - c) ' 



(19b) 



In the dropout procedure, an item is pre- 
sented only until it reaches criterion. Thus, 
on the average a proportion of the items equal 
to Pr(iE= 1) will be presented once, a propor- 
tion ?x{E~Z) twice, and finally the last item 
reaching the list criterion will require 

S 

presentations. The average number of pre- 
sentations for that portion of the list which 
reaches criterion will therefore equal 

Z i P(E = i) = Zi c(l -c)^“^ 

i=l 

(1-c) (1+kgC) l-(l-x)(H-kgC) 

c c 

( 20 ) 

It is not exactly clear, given completely ran- 
dom selection of those items remaining to be 
learned in the dropout procedure, just what 
the distribution of the number of presentations 
would be for the unlearned items which remain 
when the list criterion is finally reached. 

We should not be too far off, however, if we 
assume that the noncriterion items will have 
received about as many trials as the last 
items to be learned, viz., k trials. Finally, 

O 

then, for a proportion x of the items in the 
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dropout procedure the mean number of presen- 
tations will be given by (20), and for the re- 
maining 1 - of the items the number of pre- 
sentations will be approximately Com- 
bining these results, the mean number of 
presentations per item using the dropout pro- 
cedure, k , will be 



k 



D 



1 - ( 1 -X)( 1 +kgC) 
c 



+ (1 -x)kg 



( 21 ) 



For that range of c which is of interest, 

.3 > c > .01 , ( 1 - <^) is very nearly equal 

to -c/2. From (19b), substitution of this ap- 
proximation for log (1 - c) yields 



-2 loq( 1 x) 
s'"' c 



(22a) 



from which is obtained the final result. 




X 

-2 log(l -x) 



(22b) 



In short, the relative efficiency of the drop- 
out procedure depends only on x, the propor- 
tion of items which must reach the performance 
criterion for a session to be terminated, and 
does not depend on the learning rate within the 
boundary values mentioned. From (22b), the 
dropout procedure requires 57% as many pre- 
sentations as the standard procedure for x = 

.8; for a more stringent list criterion, x = .95, 
the dropout procedure requires only 25% as 
many presentations, 



The argument above strictly holds only if 
the list length is infinitely large. For lists 
of finite size, the expected value of k ob- 

tained by (19a) may not be exactly correct. 

(Estes, 1959, pp, 36 ff.. gives the distribution 

function for k for finite lists, but the ex- 
b 

pected value involves the incomplete sum of 
a binomial distribution.) The approximation 
in (22b) was evaluated by computer simulation 
for 10-item lists and for various values of x 
and c. Agreement between the obtained and 
predicted efficiency ratios was good except 
for X. ^ 1.0, in which case the ratio appeared 
to be approaching a value of about .4. A 
reasonable conjecture would seem to be that 
(22b) holds up for short lists for all criteria 
up to (M-1)/M, but not for = 1. Simulations 

also showed that for q, ~ .9, the approxima- 

1 

tion from (22b) was quite accurate. 

In summary, given a fixed number of pre- 
sentations, the dropout procedure was shown 
by application of dynamic programming to be 
optimal. The relative efficiency of the drop- 
out procedure compared to the standard pro- 
cedure varied in a' complex fashion with the 
learning rate and the number of presentations. 
Under certain conditions, final test perform- 
ance was measurably higher under the dropout 
procedure, but other times the difference in 
procedures was quite small. By requiring that 
a session last until a criterion proportion of 
items has been learned, the relative efficiency 
of the dropout procedure was determined to a 
good approximation and was found to depend 
only on the criterion. If the list criterion was 
90% or higher, the dropout procedure should 
require less than half as many presentations 
as the standard procedure — a substantial 
gain in efficiency. 



V 

HETEROGENEITY OF LEARNING RATE PARAMETERS 



If one assumes that some items are easier 

to learn than others, then the dropout procedure 

might prove efficient because it provides more 

practice on difficult pairs. It seemed worth 

investigating this possibility in the context of 

both models. Assume that a list with two 

items is to be learned, and that item 1 is 

much easier to learn than item 2. In the all- 

or-none model, this is equivalent to c > c , 

1 2 

The dynamic programming solution handles 
this case without any need to estimate 

and the easier item will go to L relatively 

early, and the majority of the presentations 
will be given to item 2.® 

Next consider the incremental model, 
where o < a . If the two parameters are 

X u 



^If the parameters c and c were known 

prior to the experiment, the optimal decision 
strategy would actually be more complicated. 
The nature of this complication may be seen 
by considering the decision rule if the last 
response to both items was an error. If ~ 

c , there is no optimal decision. If c > c , 

^ 12 

then item 1 should be presented, since the 

expected gain, c^, is greater than for item 2. 

More generally, the a posteriori probability 
(based on successes since last error) that 
each item is in U is not sufficient to make 
the decision, but must be weighted by the 
learning rate parameter. The strategy of the 
dynamic programming solution in this in- 
stance is to teach the easy items first. In 
most instances, however, no « priori estimate 
of the learning rate parameters will be avail- 
able, unless the assumption is made that the 
relative difficulty of an item can be estimated 
over a group of students. Smallwood ( 1962) 
considers this possibility in a different con- 
text. 



known a priori, then the dynamic program- 
ming solution can be readily computed from 
the parameter values and the number of times 

each item has been presented, n and n 

1 2 . 

From (18), the LIG for item i will be a^Hl - a ) . 

i i' 

Another approach which will be useful 
later involves finding the optimal proportion 
of the available prs.sontations which should 
be assigned to item i. Assuming known param- 
eters and letting N equal the available pre- 
sentations, for the 2-item, case we can write 

v!r,\ , N-k 

E(q) = a^ + a^ , (23) 

Differentiating (23) with respect to Iz and 
solving for the minimum yields the following 
result for the optimal proportion of presenta- 
tions to be assigned to item 1, 



■'opfloga^+loga^) y 2 



^log a^ -I- ^ log 



- loga^- 
1 ^ 



I 

r 



(24) 



The problem is more difficult when, as 
is generally true, a priori estimates of tho 
learning rates do not exist. A moments esti- 
mate of a. can be readily obtained from T , 
1 i,n 

the expected total errors for item i ir\ n pre- 
sentations of the item, since 



n- 1 

J = 0 

implying that 



1 

1 - a, n 
1 

1 -a. 



(25a) 





pT, - 1 - 




r T, - i“i 


ai=di 


i,n 




i,n 


T, -a.’^ 
i,n i 




1 

G 

! 



(25b) 



As it turns out there are some practical 
problems in using (25b) for optimization. The 
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term in the denominator will be close to 
1 

zero for reasonably large values of 71 . If n 
is small and is large (fast learning), then 

(25b) leads to an estimate of biased toward 

0, so that learning may be slower than esti- 
mated. An additional problem is that T, is 

i,n 

an expected value over many items, ‘while the 
available datum for estimation in an optimiza- 
tion schedule must be based on a single item. 
Accordingly, an estimate of based on (25b) 

might be quite bad. It appears that the worst 
case arises if by chance a student guesses 
correctly during the first one or two presenta- 
tions of a difficult item. Then « , will be 

estimated to be 0, and the item is unlikely to 
bo presented again. A practical answer to 
this problem has already been suggested, 
viz., schedule tests on a regular and inde- 
pendent basis. 

Another solution is to apply the dynamic 
programming solution probabilistically rather 
than deterministically. Recall that ■*'he optimal 
decision strategy was to choose that item for 

which «,^^(l-«^) is greatest. Suppose the 

gain, relative to the rest of the items 



n. M 

^ hi -a,) / Z 



n, 



(1 









(26) 



is computed. If « is estimated as 0, an arbi 

^ 1 
trary minimum value (e.g., a constant or ~ ) 



is substituted for On each presentation, 

the probability that i selected will be 

The probabilistic choice means that even 
those items that have been tentatively iden- 
tified as easy will be presented occasionally, 
rather than being "trapped." 

A second series of simulation runs wore 
carried out to evaluate various decision strat- 
egies, assuming either the all-or-none or 
incremental model, and using heterogeneous 
parameter sets. The three decision strategics 
previously described were investigated, to- 
gether with a fourth strategy based on the 
probabilistic solution just described based 
on running estimates for «, for each item. 

Both the incremental and all-or-nono models 
were simulated. Five items in the list wore 
assigned a learning rate parameter of (or 

1 - «j), and the other five were assigned a 



parameter of c^. The number of available 

presentations was set at 80. The entries under 
"Parameter Known" give the expected state 
values if the available presentations were 
distributed optimally using (24). 

The results of these studies are pre- 
sented in Table 2. For the all-or-none model, 
the dropout procedure is best; the standard 
procedure, worst; and the other two strategies 
fall between. As was suggested earlier, the 
dropout procedure handles parameter hetero- 
geneity with no complications. 

For the incremental model, two questions 
seem pertinent: (a) When heterogeneity of 
learning rates odsts, how much gain improve- 
ment is possible by optimal scheduling based 
on a priori estimates of the parameters? and 
(b) When the parameters are unknown, how 
close do various presentation strategies come 
to achieving the potential gain in efficiency? 
The answer to the first question, at least for 
the parameter sets in Table 2, is that tlie po- 
tential gain is relatively small. Only when 
some items are learned extremely fast — one 
or two trials — and other items quite slowly 
does there socm to be any substantial differ- 
ence between strategies. As to the second 
question, only for the most extreme case 
(«j = .4, various decision 

strategies seem equally effective. In particu- 
lar, die dropout procedure seems as efficient 
as the more elaborate procedure based on the 

use of running estimates of the a, to calculate 

i 

the immediate gain for each item. 

It may be helpful to summarize the major 
points in this section on dynamic strategies. 
The derivation of optimal presentation tech- 
niques based on backward induction was in- 
troduced, and it was shown that for certain 
criteria of optimality and a general class of 
learning models, presenting that item which 
yielded the maximum Immediate gain was an 
optimal strategy. From this result, it was 
possible to show that a dropout procedure 
was optimal for an all-or-none learning model, 
whether the learning rate was the same for 
all items in the list or not. If an incremental 
model was assumed, then the standard pro- 
cedure was optimal if all items were learned 
at the same rate; otherwise, a more complex 
decision rule was required based on running 
estimates of the learning rates and calculation 
of largest immediate gain. Simulation studies 
suggested that for the case of heterogeneous 
learning rates the dropout procedure performed 
as well as the more complex strategy above. 
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Table 2. Mean State Values at End of 100 Presentations, 10-Item List for 
Various Decision Strategies, All-or-None and Incremental Models 
with Heterogeneity of Parameters 



Model 




Decision Strategy 






All-or- 


■none 


Standard 


Dropout 


Running 


Error 


Known 


^'l 








Estimate 


Proportion 


Parameters 


.60 


,05 


.31 


.14 


.23 


.20 


_ 


.30 


.10 


.20 


.01 


.12 


.11 


- 


.40 


.05 


.29 


.18 


.28 


.26 


- 


.15 


.05 


,40 


.28 


,32 


.38 


- 


.20 


.025 


.41 


.36 


.42 


.40 


-- 


Incremental 












«1 


«2 












.40 


.95 


.30 


.25 


.26 


.26 


.24 


.70 


.90 


.19 


.19 


.19 


.19 


.15 


.60 


.95 


.30 


.29 


.29 


.29 


.26 


.85 


.95 


.40 


.44 


.41 


.41 


.40 


.80 


.97 


.44 


.50 


.45 


.46 


.41 



In general, only slight gains in efficiency were 
achieved by taking into accornt heterogeneity 
in learning rates, assuming an incremental model. 

Under certain conditions, the gain in per- 
formance from the dropout technique for a fixed 
number of presentations was substantial. How- 
ever, in many other cases, the gain was negli- 
gible. More consistent outcomes were found 
by assuming that the training session was con- 
tinued until some proportion .r of the items had 
reached a performance criterion. The relative 
efficiency of the dropout procedure under these 
conditions depended only and for reasonable 
values of r the gain was considerable. 



In this connection, it might be useful to 
point out some desirable features for experi- 
ments designed to evaluate the effectiveness 
of tnese two procedures. First, as just men- 
tioned, teaching should continue to a list cri- 
terion. Second, the probability of chance 
successes should be relatively small, since 
otherwise the Bayesian estimate of the prob- 
ability that an item has been learned is not 
very reliable. Third, each item should be 
tested sufficiently often so that correct re- 
sponses arising from short-term retention do 
not carry much weight in the decision of the 
teaching system. 
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VI 

CONCLUSSONS AND IMPLSCATJONS 



The theoretical analyses of optimality 
discussed above have centered around a 
stimulus-response procedure of teaching — a 
conceptually simple paradigm and one of the 
few for which reasonably adequate models 
have been developed. A wide class of in- 
structional problems can be expressed in 
this framework. More importantly, the theo- 
retical models considered span the range of 
assumptions about how learning occurs from 
the "rote-learning" incremental model (repre- 
sentative of the Thorndikian establishment 
of connections by sheer repetition) to the 
"insight" all-or-none model based on active 
search for mnemonic links. The models con- 
sidered are oversimplifications, but in vari- 
ous situations they provide good approxima- 
tions to the learning process, approximations 
which are mathematically tractable. Because 
mathematical analysis rapidly becomes more 
difficult as one complicates a model, it would 
seem a reasonable strategy to press these 
simple representations as far as possible. 

It is my suspicion that from the standpoint of 
optimization the most serious problem with 
these models is that no provision is made for 
interitem interference — for the fact that in 
learning one item, the student may form an 
association that makes it difficult to retrieve 



other associations that were previously formed. 
This interference becomes more important with 
increased similarity among items in a list. 

In practical applications, as list length in- 
creases the amount of interitem similarity 
also tends to increase. 

It is apparent that the psychological 
processes used by a student in learning asso- 
ciations between discrete stimuli and re- 
sponses — no matter how complex either the 
stimulus or response terms may be — represent 
only a fraction of the cognitive skills in the 
student's repertoire. Very likely, different 
cognitive processes are operative when the 
student is asked to learn conceptual relations 
(e.g., physics), remember a body of facts in 
an organized or structural fashion (e.g., 
American history), or become skillful at ap- 
plying transformations of various sorts (e.g., 
algebra). The examples are for illustration 
only. In any curricular area such as reading 
or arithmetic, many different cognitive skills 
are needed. A promising approach to achiev- 
ing instructional efficiency would be to par- 
tition an area according to the requisite skills 
and seek to optimize the instructional system 
with respect to the development of those 
specific skills, as in the examples in this 
paper. 
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APPENDIX A 

EVALUATION OF BLOCK-SIZE FUNCTION 



The following proof was suggested by J, C. Falmagne. Define the two func- 
tions 



, ^kN N , , kN 

, 1 -b V , . 1- b 



1 

1 - b V 



1-b 



(Al) 



The last two terms in (8) are equal to (fe) , and the problem is to show 

that v< I implies E(q) decreases monotonically with /? , and v> 1 implies E(q) 
increases with k. Equivalently, it must be shown that for all i 






(A2) 



''>*■ +^(k + <) • 

Notice that (A^) holds if N -- 1, since = 1 for all k. We show that 

(A2) is true for all N by induction for y < 1; the proof for y > 1 is similar. 
Suppose that (A2) holds for some N. It must be shown to hold for AM- i , 

viz, , 






> 






Both (j) and 4^ can be expressed in series form, 



N-1 , . . N-1 , , 

= Tj b V , 4^j^(k) = Yi ^ ^ 

i=0 i=0 



so that (A3) is equivalent to 

(k+^)N N 



4>n ^ 



V 



4.^(k) Fb^^^v^ 



> 



Crossmultiplying and cancelling common terms, (A5) becomes 
ct)^(k) 4^^(k. + £) + ^^{k) -1- b'"^v^4^^(k -1- £). 



(A3) 



(A4) 



(A5) 



(A6) 



1 ^ 






From the inductive hypothesis, 

9 (k + ^) 

C(kT 7 y - ^ 9j,(k+Mv„ (k)> oj^(k)v„(k*o 



(A7) 



N 



hence the crossproducts of o and y may be eliminated without prejudice to the 
hypothesis. Rearranging terms and expressing g and y os series, (A6) will bo 
true if 



Oj^(k + M - + 0 - b'”^-jj(k) } 



(AK) 



or 



^2* y(k« )i . N+ki I ^ ,^N j^(k+< )i _ N+kl , ^ 

i-0 ■ ' i"0 



(A9) 



ki ) i £ N 1 

However, & < 1 , i < N , and so h - b | > 0, Moreover, v < 1, hence 

for each i, and thus for each pair of terms in the series the inequality 

is true. 
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APPENDIX B 

DERIVATION OF A POSTERIORI PROBABILITY 
OF STATE L IN ALL-OR-NONE MODEL 



Let K be the occurrence of, k successes since the last error, V be the 
event that the item is in U after K, and let X represent the relevant features 
of the learning history, viz., the last error and the /; + ! study trials (one follow- 
ing the last error, k following the k successes). Then the task is to find 
?t{u\k and X) , which may be rewritten by Bayes formula as 

?tIk\u and X) ?iu\x) 

?{K\X) 



Pr( u\k andX) = 



(Bl) 



Considering the terms one at a time, Pr(/<'| U and X) , since it is conditional on 

Ic 

U, is the likelihood that k successes occur by chance, (1-qj^) . The second 
I k+1 

term, Pr( E/|X), is (1 - c) , since the student must fail to learn the item on each 
of the k + 1 opportunities. The denominator, ?v{k\x), must be broken "'^to two 
parts, depending on whether the item was learned during one of the k + l study 
trials, or not learned at all. Accordingly 



Pv(k\X) = ?t(k\u and X)Pr(E/|x) + 



k+l 

on i andAT.)Pr(L on i|AT) 
i=l 



k+l 



= (1-q )'"(l-c)'^'*'S E c{(l-c)(l-q )} 
^ i=l ^ 



1 - {(l-c)(l-q^)} 



i-1 



k+l- 



,, ,k,. slt+1 , - 

(1-q^) (1-c) + ^ 



(B2) 



Further simplification yields the final result, 

?t{u\k and X) = TTT • (^3) 

c(l~q^ + qj{(l- c) (1-q^) } 

The derivative of (B3) with regard to h is negative, which means that 
?t{u\k and X) decreases uniformly as k increases. In other words, the item 
with the fewest successes since the last error has the highest probability of 
being in U. 

It is obvious that if q^^ = 1, then Pr(17 |/C and X) = 0 for k ^ 1; i.e., if the 

guessing rate is zero, a single success is sufficient to imply that learning has 
occurred. Inspection of (B3) suggests that as q^^ decreases to .5, ?r(U\K and X) 

will increase. Thus, the higher the guessing rate, the less certain one can be 
that the process is in L for any fixed value of k. 



APPENDIX C 

CONDITIONS FOR OPTIMALITY OF 
LARGEST-IMMEDIATE-GAIN STRATEGY 



Suppose that a student is to be taught a list of M stimulus-response pairs 
and only N presentations are available. Prior to each presentation, the instructor 
chooses a pair from the list to present to the student. The optimization problem 
of concern here involves finding a decision procedure for selecting items such 
that expected number of correct responses on a final test (or some monotonic func- 
tion thereof) is maximized. The performance level on the final test will be referred 
to as the payoff. 

The optimum decision strategy is a function of the history and gain function 
for each pair. A history, for purposes of this paper, is the sequence of correct 
and incorrect responses for an item up to some point in the presentation sequence. 
The gain function defines the expected increment in payoff which accrues if a par- 
ticular item is presented. In the present discussion, the payoff is measured by 
the number of correct responses on the final test. Accordingly, the gain function 
will be some function of the increase in the probability of a correct response ob- 
tained by presenting an item. 

The relation between payoff and gain may be further clarified by looking at 
the decision problem from a different angle. Suppose that h presentations remain 
to be made at some point. This portion of the presentation sequence can be rep- 
resented by a tree diagram with h levels, laying out the m'^ different sequences 
which might occur. Given the history at h and specification of a learning model, 
the expected payoff for each sequence may be calculated. One solution of the 
optimization problem is to calculate all possible outcomes prior to each decision, 
and present the item which leads to the highest payoff. This approach is ob- 
viously unworkable unless M and k are both quite small. A major concern is to 
discover simpler and more efficient means than enumeration of arriving at optimal 
decisions. 

It will be recalled that gain is defined as the expected increment in payoff 
obtained by presentation of an item at some point in the sequence. It is not gen- 
erally true that selection of that item which produces the largest expected gain 
on a presentation is an optimal choice in terms of the expected payoff. As a 
simple counterexample, suppose a list consists of two items, A and B, and there 
arc two presentations remaining. Suppose that the probabilities of a correct re- 
sponse following one and two presentations for A are .20 and .40, respectively, 
and for B are .10 and .60. Suppose further that the payoff is the probability of 
a correct response following the two presentations. The possible outcomes can 
be represented by a tree diagram; 



First 

Decision 


Expected 

Gain 


Second 

Decision 


Expected 

Gain 


Final 

Payoff 






A 


.20 


.40 


A 


.20 












B 


.10 


.30 


B 


.10 


A 


.20 


.30 






B 


.50 


. 60 



By this snumsration it can bs saan that tha optimal stratagy is to prasant B twice 
in succastoion, wharaas if gain wara baing maximizad, ona would prasant A twica 
in succassion. A major purpose of this saction is to show that undar cartain con- 
ditions maximization of immadiate gain is also an optimal dacision. 

Assuma that tha laarning procass for aach itam can ba axprassad as a Markov 
procass with obsarvabla statas S^, , . . S . Lat tha gain function, g {i,n), ba tha 

incramant in tha axpactad payoff of itam a; in S, whan it is presantad on n; 
gj,i,n + 1 1^0 is tha gain from prasantation on n + 1 , givan item a: was also pre- 
sented on n and was in at that point. (If the anticipation procedure is used 

in the learning task, it is necessary to be quite explicit. The state of the item 
will mean its state prior to the test portion of the presentation.) 

The notation is unavoidably complicated by the need at various times to take 
note of the particular item, the state of the item, and the position in the presenta- 
tion sequence. Where it is of no consequence, the item subscript will be dropped. 
The index n is especially troublesome. It may be used to indicate either of two 
related numbers: (a) position in the presentation sequence counting forward from 
the first decision, denoted by n, or (b) position in the sequence counting back- 
ward from the last decision, denoted by k. A largest-immediate-gain (LIG) deci- 
sion is optimal under dynamic programming if for all items (Cl) g{i ,n) > g{j,n) 
for I < j, and (C2) if g{i,n)> g{i,n + 1 \ n). The first condition requires that the 
states be arranged so that the expected gain decreases monotonically with the 
state index. For the second condition to be met, if an item in S, is presented 

on 11 and also on the gain on w + 1 must not be greater than the gain on n. 

If these conditions are met, the LIG decision is the choice of an item with the 
smallest current state value. 

The proof proceeds by induction, and follows a discussion in Matheson 
(1964, pp. 22-24) with changed notation. It is obvious that the LIG decision is 
optimal for the last decision in the sequence, because no further decisions re- 
main to be made. For the inductive hypothesis, assume that with h decisions 
remaining, the LIG decision is optimal. Suppose that with /e + 1 decisions re- 
maining, item X yields LIG, and item y is an optimal choico. The proof involves 
showing that on /? + 1 , at is at least as good a choice as y. 

Recall that g { i,k) is the expected gain from presenting item a; in state i 

with h decisions remaining. By assumption g {i, k + 1) > g (j,k + l). Moreover, 

I ^ 3^ 

by C2 g {j,k\k + l)>g ij,k), and so 

y y 

gj^(i.k+ 1) > gy(j,k + 1) > gy(j,k|k+ 1). ( i) 

If y is selected for presentation at k + 1 as the optimal choice, then g {m,k) = 
g /e+ 1) for all z y. In other words, there is no change in state and therefore 

no change in gain from /? + 1 to k except for the item presented at /? + 1. In partic- 
ular, g^{i,h) = gj,i,k-^ 1). Accordingly, a; is LIG at decision k, and by the in- 
ductive hypothesis is the optimal choice. Given the usual assumption that learning 
proceeds independently for each item, it makes no difference in what order x and 
y are presented. The sequences {a; on /? + 1, y on k and y on /fe + 1 , a; on k) 
are therefore equivalent so that x must also be optimal on /e + 1, which completes 
the proof. 

The critical role of the gain function is apparent from the work of Karush and 
Dear (1966a) who showed that for arbitrary gain functions, a choice based on LIG 
may be far from optimal, and that more complex strategies based on gain accrued 
over two or more succeeding presentations may be necessary. 

Both the incremental and all-or-none models can be described by gain func- 
tions satisfying Cl and C2 (cf, Atkinson and Estes, 1963, for details on these 
learning models). In the incremental model, the states are indexed by the number 
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of previous study presentations. The probability of a success following the ith 
study trial is ^ ~ ^ where q is the initial error probability and a is 

the learning rate parameter. A suitable gain function is the difference between 



p ^ and K r/(l 



a)a^ ^ , which clearly satisfies both Cl and C2, 



It requires a bit more work to handle the all-or-none model, since it must 
be rearranged in the form of an identifiable state model and a suitable gain func- 
tion found. The underlying model is a two- state Markov process. In the un- 
learned state, the probability of an error is ^ ; on each trial, there is a proba- 
bility c of transition to the learned state. The learned state is an absorbing 
state, in which the probability of an error is 0. An appropriate Markov descrip- 
tion is obtained by letting the states be indexed in the number of successes 
since the last error. For derivational purposes, it is easiest to work with Q^, the 

probability that an item is in the unlearned state given i successes since the 
last error, which is (Appendix B) 



{(l-c)(l-q) - (l-c)(l-q)}, ^2) 

i+1 ' 

c(l -q) + q {(1 -c) (1 -q) } 

The all-or-none mocel is a recurrent process, and is not dependent on 

either the presentation number n or how many times an item has been presented. 
Moreover, it is simple to show that Q. decreases monotonically with i , i.e., 

that 0. 

The gain function g{i,n) will be defined as the (expected) d( 5 crement in the 
probability that an item is unlearned, given that the item is in just prior to 

being chosen for a test-study sequence on the «th presentation. The decrement 
depends on the outcome of the test, With probability < 7 Q., an error will occur, 

and the item will be identified as being unlearned with probability 1, The effect 
of the study event is to reduce this probability to 1 -c, and accordingly the gain 
from study immediately following an error is c. With probability 1 - qQ^, a suc- 
cess will be observed, the a posteriori probability will change to Q.^^. and 
the gain from the study event will be cQ. , , . The anticipation procedure neces- 

sitates careful consideration of the sequencing of events. In particular, the 
gain depends on the outcome of the test portion of a trial and the immediately 
following study portion, as does the determination of the state S, of an item. 

The relation between the events of a trial, the gain associated with those events, 
and the various indices is shown on the next page for an item which is presented 
for two trials in succession. 

The expected immediate gain from presentation of an item in state i, 
g{i,n) is Just 

g(i,n) = c[ (i - 

Straightforward algebra shows that (3) meets Cl, i,e,, that g(i,n) > g(i + l,n): 



(l-qQl)Ql,, +qQi>(l-<jQl,l)Qi,2+<jQn.l. 

(1 - -q)(Qi-Ql„). (' 

But clearly, (1 -qQ._|_^) >( 1 - q). and since Q. decreases with j.(Q, -Q.^2^- 




Error 



Demonstrating that (3) satisfies C2 requires a little more care, From the 
two-trial sequence presented above, it can be seen that the problem is to show 

that 



The rightmost terms involving qQ^ are easily shown to satisfy the inequality. 
The left hand terms in (1 -qQ^ must be changed to a form such as 



and the appropriate substitutions made from (2) in order to see that the result 
is equivalent to c( 1 ~ q) ^ ^ > which clearly holds. 

Thus for the all-or-none model, there is an observable state representation 
and a gain function satisfying Cl and C2 which are appropriate for the adoption 
of an LIG strategy. It might be mentioned that there exist other plausible gain 
functions for this same representation of the model which do not satisfy Cl or C2. 
For example, in the initial stages Ox working on this problem, the gain was de- 
fined as the expected decrement in Q^, the a posteriori probability of being in 

the unlearned state given i successes since the last error. This gain function 
does not satisfy either Cl or C2 , and hence the LIG strategy is inappropriate. 
However, on reflection it is apparent that the function is not appropriate for 
another, more pertinent reason. The gain represented by this function relates to 
the increase in the teacher's certainty that learning has occurred, not in the 
probability that learning actually takes place. For example, suppose that q^ 

is .5, and that c is quite small. The likelihood of chance success is so high 
that strings of up to three successes are not very diagnostic of the learning 
state; in other words, decreases slowly for 0 < i< 3. Longer strings quickly 

become improbable under the hypothesis that the item is unlearned, and so 

decreases rapidly. Under these conditions of the parameters and for this par- 
ticular gain function, the optimal choice is to choose an item with a string of 
two or three successes — if another success occurs, there will be a big jump in 
the a posteriori probability that the item is learned, (One might think of this 
gain function as suitable for the type of individual who turns to the back of a 
mystery novel to find out "who done it?") In summ.ary, the demands on the model 
in this approach are relatively minimal, but the selection of a proper gain function 
is critical. 



(1 - 

(1 - qQ,)[(l - qQ:, JQ 



i+r i+2 



+ qQ^ > 

+ qQ.^j] +qQj(i-qQo)Qi + qQo]- 



(5) 
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